In this paper we consider stochastic multiarmed bandit problems. Recently apolicy, DMED, is proposed and proved to achieve the asymptotic bound for themodel that each reward distribution is supported in a known bounded interval,e.g. [0,1]. However, the derived regret bound is described in an asymptoticform and the performance in finite time has been unknown. We inspect thispolicy and derive a finite-time regret bound by refining large deviationprobabilities to a simple finite form. Further, this observation reveals thatthe assumption on the lower-boundedness of the support is not essential and canbe replaced with a weaker one, the existence of the moment generating function.
展开▼